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The  Principal  Discriminant  Method  (PDM)  of  prediction  employs  a  novel  combination  of  principal 
component  analysis  and  statistical  discriminant  analysis.  Discriminant  analysis  is  based  on  the  construc¬ 
tion  of  discrete  category  subsets  of  predictor  values  in  a  multidimensional  predictor  space.  A  category 
subset  contains  those  predictor  values  which  give  rise  to  a  predictand  (or  observation)  in  that  particular 
category.  A  new  predictor  value  is  then  assigned  to  a  particular  category  (i.e.,  a  forecast  is  made)  through 
the  use  of  probability  distribution  functions  which  have  been  fitted  to  the  category  subsets.  The  PDM 
uses  principal  component  analysis  to  define  the  multidimensional  probability  distribution  functions 
associated  with  the  category  subsets.  Because  of  its  underlying  discriminant  nature  the  PDM  is  also 
applicable  to  problems  in  data  classification.  The  PDM  is  applied  to  prediction  problems  using  both 
artificial  and  actual  data  sets.  When  applied  to  artificial  data  the  PDM  shows  forecast  skills  which  are 
comparable  to  those  of  standard  forecast  techniques,  such  as  linear  regression  and  classical  discriminant 
analysis. -When  applied  to  actual  data  'ti  a  forecast  of  the  1982-1983  El  Nino,  the  PDM  performed 
poorly.  However,  in  forecasting  winter  at  temperatures  over  North  America,  the  PDM  proved  superior 
to  other  fojrecast  techniques,  after  suitabl;  filtering  or  smoothing  the  raw  data  in  order  to  improve  the 
signal-to-ndise  ratio.  It  is  expected  that  tl.  -  PDM  wilt  show  its  greatest  advantage  over  other  forecast 
techniques  When  the  relation  between  predi.  tors  and  predictand  is  nonlinear. 
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1.  Introduction 

Discriminant  methods  in  general,  and  the  Principal  Dis¬ 
criminant  Method  (PDM)  in  particular,  can  be  applied  to 
forecasting  problems  in  which  it  is  desired  to  forecast  a  dis¬ 
crete  state  of  the  atmosphere  or  ocean.  An  example  is  the 
forecasting  of  seasonal  temperatures  as  one  of  the  three  dis¬ 
crete  states  “above  average,"  "average,”  or  “below  average.” 
Because  of  its  underlying  discriminant  nature  the  PDM  can 
also  be  used  in  data  classification.  An  example  is  the  assign¬ 
ment  of  the  observed  state  of  the  atmosphere  to  one  of  several 
discrete  “climate  types."  A  further  application  of  the  PDM  is 
the  linking  of  the  output  of  a  general  circulation  model 
(GCM)  of  the  atmosphere  with  observed  fields  in  order  to 
produce  model-output  statistic  (MOS)  schemes  of  prediction. 
Our  description  of  the  PDM  shows  its  essential  form,  so  as  to 
facilitate  applications  to  any  of  the  problems  just  mentioned. 

The  successful  construction  of  category  subsets  in  a  multidi¬ 
mensional  predictor  space  is  a  sine  qua  non  of  any  discrimi¬ 
nant  method,  along  with  the  fitting  of  versatile  probability 
density  function  (pdf's)  to  these  subsets.  The  modifier  “prin¬ 
cipal”  in  the  name  of  the  present  method  derives  from  the  fact 
that  for  multiple  predictors,  essential  use  is  made  of  principal 
component  analysis  (PCA)  in  order  to  determine  appropriate 
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probability  density  functions  for  the  category  subsets.  This  is 
the  major  difference  between  the  PDM  and  standard  discrimi¬ 
nant  analyses.  In  effect,  it  allows  for  irregular  distribution  of 
prediction  data  that  consequently  do  not  fit  well-known  pdf's, 
these  pdf's  being  the  heart  of  any  discriminant  method. 

Another  unique  feature  of  the  PDM  is  that  of  self- 
evaluation  of  predictive  skill.  This  is  supplied  by  three  indices 
of  skill;  the  potential  predictability,  the  potential  0-class  error 
and  the  potential  I -class  error  in  the  predictand  categories. 
These  indices  along  with  their  critical  values,  supplied  by  a 
Monte  Carlo  technique,  help  the  user  to  decide  how  much 
confidence  to  place  on  a  given  prediction  made  by  the  PDM. 
Also,  during  the  construction  of  the  PDM's  working  parts, 
provision  is  made  to  test  the  method  on  an  independent  data 
set.  This  testing  gives  another  indication  of  how  well  a  data  set 
is  constituted  to  allow  predictions  of  its  variables'  future 
states. 

The  exposition  of  the  PDM,  which  is  the  main  goal  of  this 
paper,  will  be  made  in  two  parts.  The  first  part  (section  2) 
treats  the  case  of  a  single  predictor,  in  which  case  the  PDM 
reduces  to  a  classical  discriminant  method.  In  real  appli¬ 
cations  the  single-predictor  mode  can  yield  much  information 
about  the  potential  predictability  of  a  predictand  by  a  given 
predictor,  along  with  some  information  about  the  skill  of  the 
predictions.  The  single-predictor  mode  of  the  PDM  can  there¬ 
fore  stand  as  an  independent,  preliminary  prediction  method. 
The  second  part  (section  3)  treats  the  case  of  multiple  predic¬ 
tors.  It  is  expected  that  the  predictability  will  increase  when  a 
single  predictor  is  joined  by  several  more  predictors  and  when 
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Fig.  1.  Illustration  of  a  predictor-piedictand  pair  and  a  tercilc 
categorization,  (u)  A  standardized  predictor  time  series  X(j,  k),j=  1, 

' ,  JV  =  21,  where  k  is  fixed.  (6)  The  corresponding  time  senes  of  the 
predictand  values,  Y'iJ)',  boundary  values  0,  and  0,  are  indicated,  (c) 
The  terciled  values  of  the  predictand,  Y(J). 

the  category  subsets  in  the  resultant  multidimensional  predic¬ 
tor  space  can  be  carved  out  of  the  swarm  of  data  points  there. 
It  is  in  this  mode  that  the  PDM  realizes  its  full  power,  via  its 
application  of  principal  component  analysis  to  the  multidi¬ 
mensional  swarm  of  data  points. 

Section  4  discusses  the  results  of  using  the  PDM  in  various 
forecast  situations.  This  rather  brief  discussion  is  intended  to 
highlight  some  of  the  strengths  and  weaknesses  of  the  PDM,  a 
goal  in  concert  with  the  theoretical  nature  of  the  rest  of  the 
paper. 

This  paper  is  a  condensation  of  a  technical  memorandum 
\_Preisendorfer  et  al.  1987],  which  can  be  consulted  for  a  more 
detailed  presentation,  especially  of  the  results  discussed  in  sec¬ 
tion  4.  The  reader  desiring  an  elementary  discussion  of  dis¬ 
criminant  analysis  in  its  conventional  statistical  formulation  is 
referred  to  Lachenbruch  [1975].  Applications  of  the  discrimi¬ 
nant  method  in  climate  forecasting  are  given  by  Harnack  et  al. 
[1985].  For  a  discussion  of  principal  component  analysis,  see 
Preisendorfer  [1988]. 

2.  The  Single-Predictor  Stage 
It  is  assumed  that  we  have  available  a  data  set  consisting  of 
simultaneous  observations  of  both  predictors  and  predictands. 
Such  a  data  set  is  required  in  order  to  construct  the  PDM 
model.  After  the  model  has  been  constructed,  it  is  capable  of 
making  forecasts  when  given  new  predictor  values. 

2.1.  The  Predictor-Predictand  Pair 
Let  X{J,  k)  denote  the  value  of  the  fcth  predictor  X  at  time  j. 
It  is  convenient  to  standardize  the  predictor  in  time,  so  that 
the  time  aeries  X(j,  k),  J  I,  2,  ■  ■  ■,  haa  zero  mean  and 
unit  variance  for  eadi  It,  It  »  1,  2,  •  •  ■ ,  A.  Let  F'[/)  denote  the 
value  of  the  predicttnd  Y'  at  the  same  time  j.  For  example,  in 
a  model-output  stttMk  setting,  the  various  predictors  XU,  k) 


might  be  tbe  sea  surface  temperature  (k  =  1),  the  sea  level 
pressure  (It  =  2),  the  relative  humidity  (It  =  3),  etc.,  all  at  the 
same  spatial  location,  and  a  particular  predictand  Y'(j)  might 
be  the  horizontal  visibility  at  the  same  time  J  and  at  the  same 
or  a  different  location. 

In  order  to  use  the  predictive  capabilities  of  the  PDM,  we 
introduce  a  time  lag  t  into  y'(]),  so  as  to  pair  Y'U  +  t)  with 
X(j,  k),  T  >  0.  For  simplicity  it  will  be  assumed  that  t  has  been 
introduced  into  Y'(j),  and  we  will  retain  the  notation  X(j,  k) 
and  Y'{j)  for  the  lagged  predictor-predictand  pair,  where  now 
]  =  1,  2,  ■  ■  • ,  N  labels  the  common  ranges  of  times  of  the 
lagged  pair.  Hereafter  it  will  be  assumed  that  each  predictor- 
predictand  datum  pair  is  statistically  independent  from  other 
members  of  the  data  set.  This  condition  can  be  tested  and  the 
original  data  suitably  redefined  to  ensure  independence  if  nec¬ 
essary.  Several  of  the  methods  to  be  discussed  later  require 
this  property  of  the  data. 

2.2.  Q-tiling  the  Predictand 

Divide  the  range  of  predictand  values  {  Y'(j}:  j  =  1,  •  •  • ,  N] 
into  Q  intervals.  By  judicious  choice  of  the  boundary  values 
B„  flj,  •  •  ■ ,  Bq-  1  between  these  intervals,  we  can  “C-tile"  the 
predictand  Y'U)  into  Q  discrete  categories.  Let  YU)  denote  the 
value  of  the  discrete  category  to  which  Y'U)  belongs;  thus 
YiJ)  =  q  if  Y'U)  falls  into  category  q,  I  ^q  ^Q.  Figure  1 
illustrates  these  ideas  for  the  case  of  Q  =  3,  called  a  “tercile 
categorization.”  In  Figure  I  we  define  YU)  as  follows: 

y(])  =  1  if  Y'U)  <  B, 

YU)s2  if  B,^Y'U)<B, 

YU)  =  3  if  Bj  S  Y'U) 

for  ;■  =  1,  •  •  -  ,  N.  There  is  no  requirement  that  the  boundary 
values  be  equally  spaced  or  that  the  Q  categories  be  equally 
populated  after  the  Q-tiling  of  the  predictand. 

2.3.  The  Discriminant  Set 

The  time  series  for  the  Itth  predictor  Xfj,  k)  (Figure  la)  and 
the  Q-tiled  predictand  Y(/)  (Figure  Ic)  can  be  combined  to 
form  a  single  diagram,  called  the  discriminant  diagram.  Figure 
2  shows  the  discriminant  diagram  corresponding  to  Figure  I. 
In  this  example  one  sees  at  a  glance  that  large,  positive  predic¬ 
tor  values  tend  to  be  associated  with  terciled  predictand 
values  in  category  1,  predictor  values  near  zero  are  associated 
with  category  2  predictand  values,  and  large,  negative  predic¬ 
tor  values  tend  to  correspond  to  predictand  values  in  category 
3.  The  discriminant  set  consists  of  the  N  points  [Xf],  It),  VI/)], 
J=  1,2,  -  ^IV. 


-2-10  1  2 
Xa.k) 

Fig.  2.  The  discriminant  diagram  corresponding  to  Figure  la  and 
Figure  Ic,  where  k  is  held  Axed  as;  runs  from  I  to  At  =  21. 
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2.4.  Training  and  Testing  Sets 

The  discriminant  set  of  N  points  is  randomly  split  into  two 
subsets  of  predetermined  sizes,  N„  and  The  subset  con¬ 
taining  points  is  the  training  set,  and  the  subset  containing 
iV„  points  is  the  testing  set.  Typically,  we  choose  =  2N„ 
so  that  two  thirds  of  the  N  available  data  points  can  be  used 
to  “train,”  or  to  construct,  the  PDM;  one  third  of  the  points 
can  be  used  to  “test,”  or  to  score,  the  PDM.  Figure  3  shows  a 
possible  partition  of  the  points  of  Figure  2  into  training  and 
testing  sets.  Let  XJi,  k),  i  =  I,  2,  ■  ■  ■ ,  denote  those  values 
of  X(j,  k)  which  fall  into  the  training  set.  Likewise,  let  YJii), 
i  =  1,  "  - ,  JV„,  denote  the  corresponding  values  of  YUi-  Those 
points  of  the  discriminant  set  which  have  been  randomly  as¬ 
signed  to  the  testing  set  are  denoted  by  k),  YJiif],  ■  =  1, 
2,  •  •  ■ ,  iV„.  In  order  to  fully  utilize  the  training-testing  set 
partition  philosophy,  it  is  necessary  that  the  “training”  data  be 
statistically  independent  from  the  “testing”  data.  This  is  a 
critical  factor  in  our  procedure,  and  henceforth  we  assume 
that  independence  has  been  established  (compare  section  2.1). 

2.5.  Category  Subsets  of  Predictor 
Space 

The  subset  of  predictor  points  in  the  training  set  which  is 
associated  with  category  q  of  predictand  values  is  termed  the 
^th  category  subset  of  the  predictor  space,  denoted  by 
q=  1,  2,  •  •  • ,  Q,  whose  elements  are  C,(i),  i  =  1,  2,  ■  •  • , 
Figure  3  shows  the  three  category  subsets  for  the  illustrated 
training  set:  C,  with  M,  =  3,  Cj  with  Mj  =  6,  and  Cj  with 
Mj  =  5.  The  category  subsets  form  the  heart  of  the  discrimi¬ 
nant  structure  of  the  PDM. 

2.6.  Fitting  the  Probability  Density 
Functions 

Once  the  category  subsets  of  predictor  points  have  been 
obtained,  any  discriminant  method,  including  the  PDM,  re¬ 
quires  the  fitting  of  probability  density  functions  to  these  cat¬ 
egory  subsets.  A  decisive  point  in  the  discriminant  method  can 
arise  when  choosing  the  specific  form  of  the  probability  den¬ 
sity  function  to  be  fitted  to  the  category  subsets.  We  choose 
the  Gaussian  distribution  for  this  exposition,  although  it  may 
be  worthwhile  in  other  applications  to  use  a  pdf  specifically 
tailored  to  a  given  data  set.  The  form  of  the  Gaussian  pdf  for 
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Fig.  3.  A  partitioning  of  the  ditcTiniinant  set  shown  in  Figure  2 
into  (a)  a  training  set  and  (h)  a  testing  set.  The  category  subsets  of  the 
training  set  are  indicated.  * 


X 

Fig.  4.  The  pdf’s  and  ^,(X)  for  the  category  subsets  of 

Figure  3a. 


category  q  is 

<t>JIX)  =  (2x0/)-''^  exp 

where  is  the  average  over  i  of  the  qth  category  {C^i):  i  -  1, 

' ,  M^}  and  is  the  variance  of  this  set  of  points. 

Note  that  although  the  original  data  set  XU,  k),j=  1,  ■  ■  ■ , 
JV,  was  standardized  to  zero  mean  and  unit  variance,  the  cat¬ 
egory  subsets  C,  in  general  have  nonzero  means  and  nonunit 
variances.  Figure  4  shows  the  fitted  Gaussian  pdfs,  <t>i{X), 
<t>2{X),  and  <t>s{X),  for  the  category  subsets  of  Figure  3fl.  Once 
the  0,(X),  q  =  I,  ■  ■  • ,  Q,  have  been  determined,  the  construc¬ 
tion  (or  training)  of  the  single-predictor  PDM  model  is  com¬ 
plete.  Observe  that  implicit  in  the  d>^(X)  is  the  fact  that  they 
were  constructed  for  a  particular  realization  of  the  training 
set.  A  different  partition  of  the  discriminant  set  into  training 
and  testing  sets  would  yield  somewhat  different  i/>,(X)  func¬ 
tions. 

2.7.  Making  a  Prediction 

Suppose  a  new  predictor  realization  X'  occurs  for  predictor 
k;  i.e.,  X'  =  X{j,  k)  for  some  time  j.  We  wish  to  use  the  PDM 
model  constructed  earlier  in  order  to  make  a  predictand  fore¬ 
cast  for  the  new  predictor  value  X'.  Various  strategies  can  be 
adopted  regarding  the  manner  in  which  the  pdPs  <I>J[X)  are 
employed  in  making  a  forecast.  Two  of  the  more  obvious  are 
discussed  in  the  following  subsections. 

2.7.1.  Maximum  probability  strategy.  Given  a  predictor 
value  X',  wc  compute  ^/X")  for  each  category  q  =  I,  ■  ■  ■ ,  Q 
and  note  which  q  value,  call  it  q',  has  the  maximum  pdf  value. 
The  prediction  is  then  that  Y{f)  =  q'-  In  Figure  4  we  see,  for 
example,  that  X'  =  —0.5  would  yield  a  prediction  of  Y  in 
category  3,  X'  =  0.0  would  predict  T  =  2,  and  so  on. 

2.7.2.  Bayesian  strategy.  The  maximum  probability  strat¬ 
egy  is  easily  interpreted  and  computationally  simple:  however, 
it  may  not  make  the  best  use  of  the  available  information.  The 
method  of  Bayesian  inference  is  perhaps  better  suited  to  the 
problem  at  hand. 

Strictly  speaking,  the  dtj^X)  pdfs  relate  to  conditional  prob¬ 
abilities:  namely,  tpJiX)  gives  the  pdf  of  X,  given  that  category 
q  is  observed.  To  fix  this  idea,  let  us  write  0(X  |  q)  ^  0/X). 
What  we  really  need  in  order  to  make  a  forecast  is  the  prob¬ 
ability  that  category  q  occurs,  given  that  a  specific  value  of  X 
occurs;  let  us  denote  this  by  PfqjX).  The  category,  call  it  q', 
with  the  greatest  probability  P(q|X)  for  the  given  value  of 
X  =  X'  is  then  the  category  forecasted  by  the  PDM  when  X' 
is  observed.  Since  the  Q  predicund  categories  are  mutually 
exclusive  and  exhaustive,  Bayes’  theorem  (see,  for  example. 
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Note  that 


P{q\X)^P{X\q\P{qi 


P{X\q)P{q)\ 


=  ^X\q)P{qi 


I  P'(i,  q)  =  1 

»=  1 

If  the  pdf's  are  identical,  P'(i,  q)  =  \/Q.  Thus  a  measure  of  how 
far  the  pdf's  are  from  being  identical  is 


can  be  used  to  obtain  the  desired  P{q  \  X)  values.  Here  1 4) 
is  the  probability  of  X  given  q,  which  is  just  ^Xlq).  P(q), 
known  as  the  a  priori  probability  of  category  q  occurring,  lies 
at  the  heart  of  Bayesian  inference.  P{q)  is  a  measure  of  our 
knowledge  about  what  forecast  category  will  occur  before  the 
predictor  value  X  is  obtained.  The  selection  of  appropriate 
P{q)  values  is  a  task  which  falls  on  the  user  of  the  Bayesian 
strategy  and  is  an  extra  computation  above  those  required  for 
the  maximum  probability  strategy. 

If  we  were  making  a  random  forecast  of  Y{j),  it  would  be 
reasonable  (but  not  necessary)  to  make  the  probability  of  ran¬ 
domly  choosing  category  q  proportional  to  the  number  of 
points  of  the  training  set  which  fall  in  category  9.  So  a  reason¬ 
able  choice  of  P{q)  is 

M, 

P{q)  =  -rf  9=I,--.<2 

It  should  be  understood  that  in  making  this  choice  of  P{q)  we 
are  allowing  information  about  the  relative  distribution  of 
points  in  the  category  subsets  to  influence  the  PDM's  forecast 
of  the  predictand  when  given  a  new  predictor  value  X'.  This  is 
the  whole  point  of  the  Bayesian  strategy.  Another  choice  of 
P(q)  could  lead  to  an  entirely  different  forecast  being  made  for 
the  same  X'  value.  If  we  wish  to  make  no  use  of  our  knowl¬ 
edge  about  the  distribution  of  points  in  the  category  subsets, 
we  can  pick  P{q)  i/Q  for  all  q.  This  is  the  case  of  equally 
likely  a  priori  distributions,  for  which  the  Bayesian  strategy 
reduces  to  the  maximum  probability  strategy. 


2.8.  Potential  Predictability 

The  PDM  as  it  now  stands  is  ready  to  make  predictions  by 
whichever  strategy  is  chosen  in  the  previous  paragraph.  How¬ 
ever,  it  is  of  great  interest  also  to  compute  some  measure  of 
confidence  in  these  predictions,  i.e.,  to  ascertain  the  expected 
forecast  skill  of  the  PDM.  When  the  pdPs  ^^X)  are  not  well 
separated,  then  the  predictions  have  low  skill,  no  matter  what 
prediction  strategy  we  choose.  Note,  for  example,  in  Figure  4 
that  for  predictor  values  X'  near  O.S  it  is  nearly  equally  prob¬ 
able  that  the  predictand  is  in  category  1  or  2,  if  we  use  the 
maximum  probability  strategy.  Conversely,  if  the  ^^X)  are 
well  separated,  then  the  PDM  has  no  difficulty  in  determining  which  reduces  to  the  previous  definition  of  P'(i,  q)  if  the  a 

which  pdf  has  the  maximum  value  for  a  given  X',  and  we  have  priori  distributions  Pfq)  are  chosen  to  be  equally  likely,  i.e., 

greater  confidence  that  the  predictions  will  be  correct.  There-  Plq)  =  I/Q. 

fore  a  measure  of  our  confidence  in  the  predictions  can  be  PP  is  implicitly  indexed  by  k  for  the  particular  predictor 
obtained  via  a  measure  of  how  well  separated  are  the  pdfs.  X(J,  k)  in  question.  Moreover,  PP  depends  on  the  particular 

One  measure  of  this  separation  is  given  by  the  potential  pre-  partition  of  the  discriminant  set  into  training  and  testing  sets, 

dictability  index  (PP).  Note  that  this  index  is  distinctly  differ-  Thus  one  should  make  several  (say  D)  random  partitions  of 

ent  from  prior  uses  of  "potential  predictability”  in  the  litera-  the  discriminant  set  and  compute  PP  for  each.  Then,  in  the 

ture,  for  example.  Madden  and  Shea  [1978],  final  tally  the  average  PP  (AVGPP)  over  all  partitions  should 

First  define  be  taken ; 

/»•(«,  9)  «  dtJiXjli,  k)]l  £  dtJiXjli,  k)]}  '  AVGPP(k)  =  ^  £  PP(k,  to) 

for  I  •  I,  ‘  where  9  *  1.  *  *  Q,  and  k  is  held  fixed,  where  we  now  explicitly  show  the  predictor  (fc)  and  partition 


Moreover,  if  the  pdFs  are  perfectly  separated,  then 


®  r  IT 


where  the  first  term  on  the  right-hand  side  of  the  equation 
results  from  the  single  occurrence  of  P'(i,  4)  =:  1,  in  the  sum, 
and  the  remaining  terms  on  the  right-hand  side,  Q  —  I  in 
number,  result  from  P'(i,  q)  =  0.  Therefore 


«  r  iT  e- 1 


Thus  we  are  led  to  define 

Clearly,  PP(i)  =  1  if  the  pdf's  are  perfectly  separated  and 
PP(i)  =  0  if  the  pdf's  are  identical.  Finally,  we  define  the  po¬ 
tential  predictability,  PP,  as 

PP  =  I  PP<') 

Thus  PP  has  the  property  0  S  PP  <  1  and  is  a  measure  of 
how  distinct  the  pdPs  are:  PP  approaches  zero  as  the  pdfs 
become  identical  (and  our  confidence  in  a  prediction  de¬ 
creases),  and  PP  approaches  I  as  the  pdfs  become  widely 
separated  (and  our  confidence  in  a  prediction  increases).  This 
definition  for  PP  is  consistent  with  the  choice  of  the  maximum 
probability  strategy  for  making  a  forecast,  as  discussed  in 
2.7.1.  If  the  Bayesian  strategy  of  section  2.7.2  is  chosen,  the 
definition  must  be  modified  slightly  by  using 


PV.q}  =  P[qlX  =  XJi,k)J 


=  <t>,[XJi,  fc)]P(q). 


{£  <^,1 

I 
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l-dass  error  score.  Then  define 

flo  s  ^  [number  of  0-class  errors] 

1 

a,  =  —  [number  of  1-class  errors] 

K 

clearly,  a,  and  a,  satisfy 

0  S  flo  S  1 

0  <  fl,  s  1 

The  larger  Og  is,  the  better  the  PDM  has  forecasted  the  testing 
set  values,  and  the  smaller  a,  is,  the  better  the  PDM  has 
performed.  Unlike  PP,  dg,  and  d,,  which  are  based  on  the 
fitted  pdfs  defining  the  PDM  model,  ag  and  a,  are  actual 
forecast  scores  made  by  the  PDM  when  applied  to  an  inde¬ 
pendent  testing  set.  Our  studies  of  the  PDM  in  section  4  will 
make  use  of  the  training  and  testing  sets  in  the  manner  just 
discussed:  the  PDM  will  be  defined  using  the  training  set,  and 
its  performance  will  then  be  evaluated  using  the  testing  set. 
The  ag  and  a,  scores  are  a  convenient  means  of  presenting 
forecast  skill  when  discrete  forecast  categories  are  used.  (See, 
for  example,  Preisendorfer  and  Mobley  [1984]  for  the  use  of  Ug 
and  a,  in  scoring  seasonal  climate  forecasts.) 

2.11.  Significance  Tests  for  Class  Errors 

The  Monte  Carlo  procedure,  used  in  section  2.9  to  deter¬ 
mine  the  S%  critical  value  for  potential  predictability,  is 
equally  applicable  to  the  determination  of  critical  values  for 
dg,  a,,  ag.  and  a,.  For  each  of  the  100  realizations  of  the 
random  data  set  R,  we  can  compute  dg  and  a,  from  the  associ¬ 
ated  training  set,  and  we  can  compute  ag  and  a,  scores  from 
the  associated  testing  set.  We  then  determine  the  5%  upper 
critical  levels,  dg(96)  and  ag(96),  and  the  S%  lower  critical 
values,  d|(0S)  and  a, (05).  Significantly  good  predictions  will 
have  dg  and  ag  scores  that  equal  or  exceed  ag(96)  and  Og(96), 
respectively.  Significantly  good  predictions  will  have  d,  and  a, 
scores  that  equal  or  are  less  than  a, (05)  and  a, (05),  respec¬ 
tively.  Note  that  when  more  than  one  predictor  is  considered 
(section  3.1),  estimation  of  significance  level  becomes  more 
complicated. 

2.12.  Ranking  and  Screening  Single 
Predictors 

The  net  result  of  this  section  is  the  ability  to  individually 
rank  (for  a  given  predictand  Tf]))  the  predictors  X{j,  k),k=  I, 
■■  ■,  K,  in  terms  of  their  PP,  dg,  a,,  ag,  and  a,  scores.  Those 
predictors  that  have  significant  potential  predictability  and 
class-error  scores  become  candidates  for  further  consideration 
in  the  multiple-predictor  stage.  Predictors  that  have  non¬ 
significant  scores  as  single-predictors  of  a  predictand  are  un¬ 
likely  to  add  useful  information  if  they  are  combined  with 
other  predictors  in  the  multiple-predictor  stage,  and  they 
therefore  can  be  dropped  from  further  consideration.  It  is  im¬ 
portant  to  remember  here  that  as  one  considers  more  and 
more  predictors,  the  probability  of  finding  an  apparently 
"good  one”  by  chance  increases.  In  fact,  if  one  considers  K 
difierent  predictors,  then  an  appropriate  5%  critical  level  for 
any  sin^  predictor  is  (O.OS)*.  Parsimony  is  obviously  called 
for  in  the  original  definition  of  the  predictor  pool. 

There  are  obviously  other  methods  of  ranking  the  predic¬ 
tors  than  those  outlined  here.  Multipie-correlation  analysis 


could  be  used  in  place  of  the  simple  correlation  described 
earlier  to  affect  the  ranking.  Similarly,  a  redefinition  of  the 
predictors  in  terms  of  their  principal  components  and  subse¬ 
quent  ranking  by  eigenvalue  size  represents  a  very  different 
approach  to  predictor  ordering  (compare  section  4.2).  What¬ 
ever  method  one  uses,  it  is  necessary  to  avoid  a  large  predictor 
pool  or  risk  the  chance  of  obtaining  false  results. 

3.  The  Multiple-Predictor  Stage 

After  performing  the  single-predictor,  ordinary  discriminant 
analyses  of  section  2  on  each  predictor  X(J,  k),  k  =  1,  ■  ■  ■ ,  /C, 
we  have,  for  a  fixed  predictand  T(j).  a  set  of  predictors  or¬ 
dered  by  their  potential  predictability  scores.  We  drop  from 
further  consideration  any  predictors  which  did  not  have  statis¬ 
tically  significant  PP  scores  in  the  single-predictor  stage,  so 
that  K,  ;£  A  predictors  remain.  We  now  turn  our  attention  to 
the  task  of  constructing  the  PDM  model  in  its  multivariate 
setting. 

We  choose  the  predictor  with  the  highest  potential  predicta¬ 
bility  score  as  the  first  predictor  to  be  included  in  the  multiple- 
predictor  PDM  model.  We  then  must  screen  the  remaining 
K,  —  I  predictors  in  order  to  select  those  which,  when  com¬ 
bined  with  the  first  predictor,  yield  a  multiple-predictor  model 
which  is,  in  some  sense,  optimum. 

3.1.  Correlational  Screening  of  Predictors 

Suppose  we  have  already  selected  L  —  1  predictors,  f.  =  2, 
•  •  • .  K,—  1.  Let  these  selected  predictors  be  X{j,  1),  I  ==  1,  , 

L  —  1.  Let  the  remaining  set  of  unselected  predictors  be  denot¬ 
ed  by  mj,  M),  u  =  1.  •  •  •  >  U;  U  +  L-}  =K,.  Let  p[u,  0 
denote  the  correlation  between  the  indicated  predictors.  The 
number 

p„..(ii)  =  Max  {|p[ii,/]l}  /=!,  ••■,L-1 

is  a  measure  of  the  distance  between  the  uth  unselected  predic¬ 
tor  IViJ,  u)  and  the  set  of  L  —  I  previously  selected  predictors 
X{j,  I).  The  larger  p^,{u)  is,  the  closer  W{j,  u)  is  to  {X(j,  1), 
/  =  1, •••,/,  —  l}asa  whole. 

When  choosing  a  new  candidate  predictor  for  addition  to 
the  previously  selected  predictors,  we  choose  that  predictor 
W(f  u)  which  has  the  minimum  correlation  magnitude, 
p^,{u).  In  so  doing,  we  are  selecting  that  predictor  which  is 
least  correlated  with  the  existing  predictors  and  therefore  most 
likely  to  add  new  information  to  the  model.  If  u'  is  the  value  of 
u  giving  the  minimum  p^,(u),  then  we  set  Xlj,  L)  =  W(j,  u'), 
J  =  1,  yv.  This  correlational  screening  is  the  first  step  in 
choosing  the  Lth  predictor.  Whether  or  not  this  candidate 
predictor  is  retained  in  the  PDM  model  will  depend  on  its 
effect  on  the  PP,  dg,  and  d,  scores,  to  be  discussed  in  section 
3.9. 

3.2.  The  L-Dimensional  Discriminant  Set 
and  Related  Subsets 

Having  added  a  candidate  Lth  predictor,  we  now  have  a  set 
of  L  predictors  which  at  each  time  j  form  a  vector  X(/)  =  [Xf  j, 
1),  X{j,  2),  ■  •  • ,  XU,  L)]  in  euclidean  L-space  E^.  As  the  time 
index  j  varies,  XU)  moves  about  in  £,,.  The  category-valued 
predictand  YUl  concurrently  changes  with  j.  The  set  of  all 
ordered  pairs  [X(j),  Tf])],  j  =  1,  S,  constitutes  the  L- 
dimensional  discriminant  set. 

The  L-dimensional  discriminant  set  is  randomly  split  into 
two  parts,  exactly  as  in  section  2.4.  The  result  is  a  set  of 
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(cu)  indices.  When  comparing  two  possible  predictors  for  a 
given  predictand,  the  one  with  the  higher  AVGPP  will  repre¬ 
sent  the  higher  predictability,  on  average. 

2.9.  Monte  Carlo  Significance  Test  for  PP 

While  one  predictor  may  have  a  higher  potential  predicta¬ 
bility  than  another,  for  a  given  predictand,  it  is  possible  that 
neither  is  significant  in  the  statistical  sense.  This  possibility 
can  be  tested  via  a  Monte  Carlo  approach.  Let  a  random 
number  generator  choose  a  class  q  at  each  time  j;  define  a  new 
array  R{fi  =  q,  j  =  I,  ■  ■  ■ ,  N,  and  replace  T  by  R  (a  random 
version  of  Y).  The  probability  of  randomly  assigning  a  partic¬ 
ular  q  value  to  R{j)  should  be  made  proportional  to  the  rela¬ 
tive  frequency  of  occurrence  of  the  qth  category  in  the  Q-tiling 
of  the  original  data  set,  so  that  the  Monte  Carlo  test  will 
simulate  as  closely  as  possible  the  real  experiment. 

We  can  now  use  the  given  predictor  set  X(j,  k)  and  the 
newly  defined  random  predictand  R(fi  to  produce  training 
and  testing  sets,  as  in  section  2.4,  and  to  carry  through  all  the 
subsequent  steps  to  obtain  a  value  of  PP.  This  entire  process 
can  then  be  repeated,  after  generating  a  new  realization  of  the 
random  predictand  R,  to  obtain  another  value  of  PP  for  a 
random  relation  between  predictor  and  predictand.  This  pro¬ 
cess  can  be  repeated  to  generate,  say,  100  values  of  PP  for 
random  predictor-predictand  connections.  These  100  values 
can  be  ordered  from  smallest  to  largest;  call  them  PP(I)  for 
the  smallest  to  PP(100)  for  the  largest.  The  S%  critical  value 
for  PP  is  then  determined  from  the  ninety-sixth  smallest  PP 
value,  PP(96).  Thus  the  probability  that  a  randomly  produced 
PP  value  will  equal  or  exceed  PP(96)  is  approximately  0.05. 
Therefore  if  the  PP  value  determined  for  the  actual  predictor- 
predictand  pair  satisfies  PP  ^  PP(96),  we  will  say  that  PP  is 
significant  at  the  5%  level. 

If  one  wants  to  establish  a  critical  value  for  AVGPPffc),  then 
the  Monte  Carlo  simulation  is  conducted  so  as  to  mimic  the 
generation  of  AVGPPflc),  as  described  in  section  2.8.  Thus  one 
randomly  produces  Q  realizations  of  PP(lc,  cu),  finds  their 
average,  and  goes  through  this  average-finding  procedure  100 
times  in  all.  The  ninety-sixth  smallest  randomly  generated 
AVGPP  value  then  gives  the  5%  critical  value  for  AVGPP. 

We  note  also  that  there  are  other  measures  of  separation  of 
the  category  swarms.  For  example.  Hotelling’s  test  (the 
multivariate  generalization  of  Student’s  t  test)  can  be  used  to 
test  the  significant  separation  of  a  pair  of  category  means  X^. 
However,  such  tests  often  depend  on  assumptions  of  normality 
or  independence  of  events.  The  potential  predictability  mea¬ 
sure  of  separation  was  developed  in  an  attempt  to  have  a 
nonparametric  test. 

2.10.  Class  Errors 

The  potential  predictability  gives  us  one  measure  of  how 
well  a  particular  predictor  can  be  expected  to  forecast  predic¬ 
tand  values.  Another  straightforward  indicator  of  how  well  a 
prediction  method  is  doing,  when  predicting  categories,  is  to 
count  the  number  of  predictions  that  are  correct  (0-ciass 
errors)  and  the  number  of  predictions  that  are  off  by  one 
category  (l-daM  errors).  In  the  PDM  we  shall  do  this  two 
ways:  we  will  determine  the  potential  0-  and  l-dass  errors,  dg 
and  4,  respectively,  using  the  training  set,  and  we  will  deter¬ 
mine  the  actual  0-  and  l-dass  errors,  Ug  and  a„  using  the 
testing  set 


2.10.1.  Potential  errors:  dg  and  d,.  Recall  the  probabil¬ 
ities  P'(i,  q),  which  were  defined  when  developing  the  PP  index 
(using  either  the  maximum  probability  or  Bayesian  strategies). 
For  each  i  value,  find  the  maximum  of  the  Q  probabilities, 
{P'(i,  q):  q  =  1,  ■■■,  Q}  and  let  q'(i)  be  the  q  value  for  which 
P'(i,  q)  is  a  maximum.  We  now  define  the  potential  0-class 
error  as 

^0  =  1^  f  «'('■)] 

I 

Note  that  as  the  pdfs  become  well  separated,  PT>.  9'(>)], 
and  consequently  dg,  approach  1.  As  the  pdf’s  become  identi¬ 
cal,  P'[i,  q'(0]  and  dg  approach  P(q'),  which  for  the  Bayesian 
case  is  1/Q.  Therefore  dg  is  another  measure,  based  on  the 
pdfs  ^/X),  of  how  confidently  we  can  expect  the  PDM  to 
make  a  correct  category  forecast. 

But  even  if  the  PDM  makes  an  incorrect  forecast,  it  is 
clearly  better  to  have  a  forecast  that  misses  by  only  one  cat¬ 
egory  than  to  have  a  forecast  that  misses  by  two  or  more 
categories.  For  example,  if  category  1  is  observed,  a  forecast  of 
category  2  is  closer  to  the  truth  than  is  a  forecast  of  category 
3.  Thus  it  is  useful  to  have  a  measure  of  how  likely  it  is  that 
the  PDM  will  err  by  only  one  category,  if  it  indeed  makes  an 
incorrect  forecast.  Toward  this  end,  we  define 

P(i,  1)  =  0 

P(i,  2)  =  P'(i,  1) 

P(i.  3)  =  P'(i,  2) 

fi(i,Q+t)  =  Pd,  Q) 

Ai,  (2  -I-  2)  =  0 

The  idea  here  is  to  have  P'[i,  q'(i)  —  1]  =  0  if  q'(i)  =  1  and 
P'P.  9'(0  +  1]  =  0  if  q'(0  =  6-  Then  define 

^  I  9'(0]  +  ^[i,  9'(0  +  2]} 

A  moment’s  reflection  shows  that  d,  is  a  measure  of  the  prob¬ 
ability  that  a  category  one  less  or  one  greater  than  the  correct 
forecast  category  will  be  selected,  if  indeed  the  q'(i)  value  gives 
a  false  forecast.  As  the  pdf’s  ip^(X)  become  well  separated,  d, 
approaches  0;  as  the  pdf’s  berome  identical,  d,  approaches 
1/Q.  Thus  we  have 

0  ^  d,  S  i  <  dg  <  1 

The  larger  dg  is,  the  better  X(J,  k)  may  predict  Y(j),  and  the 
smaller  d,  is,  the  better  X{f  k)  may  predict  Y(j). 

2.10.2.  Actual  errors:  Og  and  a,.  After  the  PDM  has  been 
constructed,  or  trained,  using  the  training  set  [2f„(i,  k),  T„(i)], 
we  can  apply  the  PDM  to  the  testing  set  predictors,  XJi,  k), 
and  can  verify  the  predictions  it  makes  against  the  actual 
observations  for  the  testing  set,  T,/i').  It  is  again  crucial  that 
the  members  of  the  testing  set  b«  statistically  independent 
from  the  training  set.  Each  time  the  PDM  makes  a  correct 
forecast,  we  tally  one  to  the  0-class  error  score,  and  each  time 
the  PDM  forecast  errs  by  one  category,  we  tally  one  to  the 
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L-component  vectors  X^ih  i  =  1,  containing  those 

elements  of  X(j)  randomly  falling  into  Che  training  set,  and 
another  set  of  vectors  X„(0>  <=1.  •••,  containing  the 
remaining  elements  of  X(j).  The  associated  sets  of  predictands 
YJ^j)  and  are  defined  just  as  before. 

We  can  now  define  subsets  of  £^,  the  setting  of  the  predictor 
space,  that  are  associated  with  each  of  the  Q  predictand  cate¬ 
gories.  The  logic  of  this  definition  is  the  same  as  that  of  sec¬ 
tion  2.5.  Thus  we  set  C/i)  =  XJO  if  YJi)  =  4;  the  number  of 
points  tallied  to  C^O 

It  is  to  the  subsets  4  =  1,  -  -  - ,  Q,  of  £|.  that  we  will 
eventually  fit  £-dimensional  probability  density  functions. 
However,  before  fitting  the  pdfs,  we  perform  a  preliminary 
analysis  of  the  £-dimensional  category  subsets  using  principal 
component  analysis  (PCA).  It  is  in  this  application  of  PCA 
that  the  PDM  parts  company  with  classical  discriminant 
analysis. 

3.3.  Binary  PCA  Decomposition  of  Category 
Subsets 

Let  us  consider,  for  didactic  purposes,  the  case  of  two  pre¬ 
dictors  (L  =  2)  and  a  terciled  predictand  (Q  =  3).  Figure  S 
shows  three  swarms  of  (artificially  generated)  points  in  Ej, 
representing  the  three  category  subsets.  In  classical  discrimi¬ 
nant  analysis,  each  category  subset  would  be  fitted  with  a 
bivariate  normal  pdf.  For  a  point  swarm  shaped  like  that  of 
category  2,  the  bivariate  normal  pdf  would  probably  be  quite 
satisfactory;  Figure  6  shows  the  category  2  swarm  and  the 
best  fit  binormal  pdf.  However,  for  an  irregularly  shaped 
swarm,  such  as  category  1  of  Figure  3,  the  bivariate  normal 
pdf  is  clearly  a  poor  representation  of  the  actual  shape  of  the 
category  subsets.  Figure  7  shows  the  best  fit  bivariate  normal 
pdf  for  category  1.  Since  discriminant  methods  depend  upon 


o 


Fig  3.  An  illustration  of  three  category  swarms  C,,  (pluses; 
M,  -  99  points),  C,  (triangles;  Mj  =  89),  and  Cj  (circles;  M,  •  112) 
in  £}. 
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Fig.  6.  The  caiegory  2  poini  swarm  of  Figure  5  and  the  probability 
contours  of  the  best  fit  bivariate  normal  pdf. 

having  pdFs  which  accurately  delineate  the  category  subsets, 
we  could  not  expect  accurate  forecasts  from  a  model  based  on 
fits  as  poor  as  that  of  Figure  7,  and  standard  discriminant 
analysis  will  fail. 

Principal  component  analysis  enables  us  to  systematically 
and  objectively  subdivide  an  arbitrarily  shaped  category 
swarm  into  a  number  of  smaller  point  swarms  in  If  each  of 
the  smaller  swarms  is  then  roughly  elliptical  in  shape  (in  terms 
of  hyperellipses  in  EJ,  then  a  multinormal  pdf  can  be  well 
fitted  to  each  smaller  swarm.  The  critical  need  for  parsimony 
in  this  subdivision  process  should  be  kept  in  mind  as  the 
reader  proceeds  through  the  next  several  sections.  The  pdf 
describing  the  original,  irregularly  shaped  category  swarm  can 


Fig.  7.  The  category  1  point  swarm  of  Figure  S  and  the  probability 
contours  of  the  best  fit  bivariate  normal  pdf. 
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both  computationally  expensive  and  overly  strenuous,  in  the 
sense  that  category  swarms  are  subdivided  just  because  they 
are  nonspherical.  Swarms  can  deviate  greatly  from  a  spherical 
shape  and  can  still  be  adequately  fit  by  a  multivariate  normal 
pdf ;  it  is  the  sinuous  shapes  (compare  Figure  8)  and  multimo¬ 
dal  (or  clustered)  point  swarms  that  need  to  be  decomposed. 

3.4.2.  Strategy  2.  One  simple  way  to  terminate  the  PCA 
subdivision  process  is  to  simply  force  all  initial  category 
swarms  X,  to  undergo  a  fixed  number  of  subdivisions,  say,  to 
level  2,  as  shown  in  Figure  9.  This  procedure  seems  to  work 
fairly  well  in  practice,  although  it  should  not  be  applied  blind¬ 
ly.  For  instance,  the  category  2  swarm  of  Figure  S,  which  was 
nearly  spherical  to  begin  with,  seems  little  distorted  by  decom¬ 
posing  it  into,  say,  the  four  subswarms  of  a  level  2  decompo¬ 
sition.  If  X,  is  sinuous,  as  are  categories  I  and  3  of  Figure  5, 
then  a  level  2  decomposition  goes  a  long  way  toward  gener¬ 
ating  a  reasonable  resolution  of  the  original  swarm,  but  with¬ 
out  getting  too  near  the  noise  level. 

3.4.3.  Strategy  3.  It  is  the  sinuous  shape  of  the  data  dis¬ 
tribution  that  causes  the  poor  definition  of  pdf’s  and  hence  the 
need  for  PDM  subdivision.  Thus  we  can  envision  measures  of 
the^kewness  and  kurtosis  of  the  data  swarms  that  could  be 
used  to  decide  if  partitioning  is  required.  It  is  clear  that  such 
higher-moment  measures  of  the  data  swarm  will  have  to  be 
able  to  discern  category  1  (Figure  S)  distributions  from  ellip¬ 
tical  distributions,  since  the  latter  are  well  represented  by  mul¬ 
tidimensional  Gaussians.  We  will  not  pursue  such  measures 
here. 
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Fig.  10.  The  category  I  point  swarm  of  Figure  S  and  the  probability 
contours  of<I>,(X),  as  determined  by  a  level  2  PCA  decomposition. 


function  Wjf)  =  so  that 


m, 

I  »n(f) = 1 

1=  1 


3.5.  Fitting  pdf's  to  the  Terminal  Nodes 

Let  us  suppose  that  the  qth  category  subset  X,  has  been 
decomposed  into  a  number  of  terminal  nodes  X^a,,  •  •  • ,  a,). 
Let  T,(t)  denote  the  tth  terminal  node  X^a,,  •  •  • ,  a,)  of  X^  and 
let  iV7J  be  the  number  of  terminal  nodes  of  X,;  f  =  1,  2,  •  •  • , 
NT^.  Thus  NT^  =  1  for  the  case  of  no  decomposition  of  the 
original  category  subset,  =  4  for  a  level  2  decomposition 
like  that  of  Figures  8  and  9,  and  so  on.  Let  Njit)  denote  the 
number  of  points  N,(3,,  - " ,  a.)  in  the  tth  terminal  node; 
i*r, 

I  N,(t)  =  Af, 

la  I 

The  centroid  of  T^t)  is  located  at  T^f).  Finally,  let  S^f)  be  the 
L  X  L  covariance  matrix  of  T^t),  with  determinant  ||  S^t)  II 
and  inverse  S, "  ‘(0- 

The  best  fit  multivariate  normal  pdf  for  the  rth  terminal 
node  T^t)  is  then 

^^f,X)-(2it)-'-'"(||S^f)||)-‘'^ 

exp  { -0.5[X  -  '(flCX  -  Tjt)]} 

(It  is  assumed  that  ||  S/t)  ||  ys  0,  so  that  S, ~  ‘(f)  exists;  if  this  is 
not  the  case,  the  PCA  decomposition  lea^ng  to  this  terminal 
node  is  not  made,  and  the  parent  swarm  is  declared  terminal.) 
X  is  an  arbitrary  point  in  £{,.  S,  ~  ‘(f)  is  readily  obtained  from 
the  eigenvalues  and  eigenvectors  obtained  in  the  PCA  of  T^f), 
namely, 

S,-‘(f)  =  (A/,-1)  i  Iw 

j-i  h 

3.6.  AssenMing  the  pf^'s 

A  multivariate  normal  pdf  is  fitted  to  each  terminal  node 
T,(t)  of  N^t)  points,  t  =  1,  •  •  ■ ,  NT^  We  define  a  weighting 


The  probability  distribution  function  for  the  qth  category 
subset  is  then  taken  to  be 

NT, 

1',(X)  =  I  WJ[t)d>Jit.  X) 

1=1 

for  q  =  1,  •  •  •,  C  and  X  in  E^.  These  pdf’s  <I>,(X)  define  the 
desired  PDM  model. 

Figure  7  showed  the  binormal  pdf  for  the  category  1  point 
swarm  of  Figure  5;  this  is  the  case  of  NT^=  1,  or  no  PCA 
decomposition  of  the  category  set.  Figure  10  shows  the  con¬ 
tours  of  <1>|(X)  when  determined  by  a  level  2  decomposition,  as 
illustrated  in  Figures  8  and  9  and  discussed  in  section  3.4.2. 
This  pdf  is  clearly  a  much  more  realistic  description  of  the 
category  1  swarm  than  is  the  pdf  of  Figure  7.  If  the  PCA 
decomposition  is  allowed  to  proceed  until  just  before  the  mini¬ 
mum  point  requirement  NJIt)  >  L  is  violated,  the  category  1 
point  swarm  of  Figure  S  is  reduced  to  23  terminal  nodes. 
Figure  1 1  shows  the  tree  diagram  of  this  maximum  possible 
decomposition.  Figure  12  shows  the  <I>,(X)  contours  deter¬ 
mined  from  the  terminal  nodes  of  Figure  11.  This  pdf  gives  a 
very  sharp  delineation  of  the  category  subset,  but  the  fine 
structure  of  the  probability  contours  is  clearly  being  deter¬ 
mined  by  tiie  individual  points  of  the  category  subset,  which 
may  be  undesirable,  as  discussed  in  section  3.4. 

3.7.  Making  a  Prediction 

Just  as  in  the  single-predictor  case,  we  must  choose  a  pre¬ 
diction  strategy  (maximum  probability,  Bayesian,  or  another) 
for  using  the  pdPs  0,(X)  to  make  a  prediction.  If  the  maxi¬ 
mum  probability  strategy  is  chosen,  then,  given  a  new  predic¬ 
tor  realization  X'  (now  an  L-dimensional  vector),  we  evaluate 
0,(X’),  q  =  1,  '  -  - ,  Q-  The  prediction  is  then  that  the  predic- 
tand  falls  into  category  q',  where  q'  is  the  q  value  correspond¬ 
ing  to  the  maximum  9/X'),  q  =  1,  -  ■ Q.  If  the  Bayesian 
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Fig.  1 1.  The  tree  diagram  showing  the  maximum  poeaibie  deooraposilion  of  the  category  1  subset  of  Figure  S.  The 
circles  represent  the  X,(a,,  -  -  - ,  a.)  subsets,  and  the  numbers  within  the  circles  give  the  number  of  points  in  the  subewaim. 
■  ■  a.).  Termini  nodes  T,(t)are  represented  by  boxes;  the  endoaed  numbers  give  Al,(t). 


Strategy  is  chosen,  the  a  priori  probabilities  can  be  set  to 
P{q)  —  MJN^  as  in  the  single-predictor  case,  and  the  pdfs 
ih,(X)  3  I  q)  are  used  in  Bayes’  formula. 

3.8.  Potential  Predictability,  Class 
Errors,  and  Significance  Tests 

These  matters  all  proceed  in  exact  analogy  to  the  single- 
predictor  case.  Thus  in  computing  the  potential  predictability 
index  for  the  maximum  probability  strategy,  we  first  compute 

m,  q)  =  <I',[X„(0]{  i  <l»,[X^i)] 

U-1 

for  =  1,  ■  ■  • ,  Q  and  i  =  1,  •  •  .  N„.  The  only  difference  from 
the  single-predictor  case  is  that  we  are  now  using  the  L- 
dimensional  training  set  values  X„|0  in  the  multivariate  pdfs 
<h^X).  Subsequent  formulas  leading  to  PP  or  AVGPP  are 
unchanged.  l  ikewise,  the  modifications  required  for  the 
Bayesian  strategy  are  trivial. 


3 


-2 
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PREDICTOR  1 

Fig  IX  The  category  1  poinu  of  Figure  S  and  the  •,(X)  probability 
contours,  as  constructed  from  the  23  terminal  nodes  of  Figure  11. 


The  potential  predictability  is  now  measuring  the  separa¬ 
tion  of  pdf's  in  an  L-dimensional  space.  Figures  13-15  show 
three  sets  of  pdFs,  as  determined  for  the  example  point 
swarms  of  Figure  5,  where  L  =  2.  Figure  1 3  (reproducing  parts 
of  Figures  6  and  7)  shows  in  superposition  the  contours  of 
equal  probability  of  the  three  best  fit  binormal  pdFs,  d>J[X),  as 
would  be  obtained  in  classical  discriminant  theory.  The  poten¬ 
tial  predictability  for  these  pdf's  is  PP  =  0.39,  when  using  the 
maximum  probability  forecast  strategy.  Figure  14  shows  the 
pdFs,  <I>^X).  as  obtained  by  level  2  PCA,  as  illustrated  in 
Figures  8-10.  The  eye  can  now  easily  distinguish  the  three 
pdf's  determined  from  the  three  point  swarms  of  Figure  5,  and 
the  potential  predictability  has  risen  to  PP  =  0,77.  Figure  15 
shows  the  pdFs  as  determined  from  the  maximum  possible 
PCA  decomposition  of  the  category  swarms,  as  shown  in  Fig¬ 
ures  II  and  12.  These  pdFs  show  even  better  separation;  as 


Fig  13.  Contours  of  equal  probability  of  the  three  binomial  pdFs 
q  >■  I,  2,  3,  fitting  the  three  category  subsets  of  Figure  S.  The 
contour  interval  is  different  for  each  of  *Tic  three  pdFs. 
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Fig.  17.  Potenlial  predictability  values  for  various  PDM  models 
(time  lag  t  =  0).  The  solid  curves  are  for  the  maximum  probability 
forecast  strategy,  and  the  dashed  curves  are  for  the  Bayesian  strategy. 
Dots  are  for  no  PCA  decomposition  of  the  category  swarms  (a  level  0 
decomposition,  equivalent  to  classical  discriminant  analysis),  triangles 
are  for  a  level  2  PCA  decomposition,  and  squares  identify  the  curves 
for  which  the  maximum  possible  number  of  PCA  decompositions  was 
performed. 


3.  All  else  being  equal,  PP  increases  as  the  number  of  PCA 
decompositions  of  the  category  swarms  increases. 

4.  All  else  being  equal,  PP  increases  as  more  predictors  are 
added  to  the  model. 

Similar  results  were  found  for  d,  and  a,,  e.g..  d,  decreases  (the 
model  becomes  better)  as  predictors  are  added,  all  else  being 
equal,  and  so  on.  This  behavior  is  consistent  with  our  expecta¬ 
tions  and  with  the  high  likelihood  that  much  of  the  apparent 
skill  is  artificial. 

Figure  18  shows  the  dependence  of  PP  on  the  time  lag  i 
between  predictor  and  predictand,  for  the  case  of  a  Bayesian 
forecast  strategy  and  a  level  2  PCA  decomposition  of  category 
swarms.  We  note  that  the  PP  scores  decrease  somewhat  as  i 
increases  from  0  to  4  months  for  the  two  predictor  model,  but 
that  the  PP  scores  are  relatively  independent  of  t  for  the 
five-predictor  model. 

Figure  18  was  generated  for  two-predictor  and  five- 
predictor  models  in  which  the  particular  predictors  in  the 
model  were  held  fixed  (i.e.,  predictors  5  and  6  in  the  first  case 
and  predictors  5,  6,  7,  2,  and  4  in  the  second  case).  In  general, 
we  would  expect  that  the  best  predictors  for  one  time  lag 


& 


0.6-1 
0.S- 
0.4- 

0.3 I - 1 — -i - 1 - 1 

0  12  3  4 

TIME  LAO  r,  mootbs 


PPtSPrMkWn) 

PP(2Pn«cwn) 


Fig.  18.  PP  scores  for  the  two-predictor  (solid  circles)  and  five- 
predictor  (open  circles)  PDM,  as  a  function  of  time  tag  t. 


might  not  be  the  best  for  another  time  lag.  Indeed,  for  t  =  0  or 
I,  predictor  S  has  the  highest  PP  of  any  single  predictor, 
whereas  for  r  =  2,  3,  or  4,  predictor  2  (Barnett's  172)  has  the 
highest  PP.  However,  for  the  present  data  set  this  dependence 
is  weak :  for  r  =  4,  predictor  2  has  PP  =  0.336  and  predictor  5 
has  PP  =  0.316. 

However,  when  the  various  PDM  models  of  Figures  17  and 
18  were  applied  to  the  testing  set,  the  performance  of  the 
PDM  was  quite  disappointing.  Indeed,  the  PDM's  tercile  cat¬ 
egory  forecasts  failed  even  to  show  the  presence  of  the  1982 
1983  El  NiAo,  let  alone  accurately  predict  its  onset.  Careful 
investigation  into  the  cause  of  the  PDM's  failure  showed  that 
the  raw  data  are  so  noisy  that  the  category  swarms  cannot  be 
adequately  distinguished :  the  points  for  the  extreme  categories 
I  and  3  are  nearly  lost  in  the  swarm  of  points  for  category  2. 
The  associated  pdf's  <I>^X)  are  correspondingly  overlapping:  a 
result  anticipated  in  point  I,  cited  earlier.  Given  such  data, 
neither  the  PDM  nor  any  similar  technique  can  be  expected  to 
show  any  useable  degree  of  forecast  skill. 

4.2.  Using  Filtered  Predictors 

If  the  poor  performance  of  the  PDM  in  the  El  Nino  forecast 
is  indeed  due  to  noise  in  the  data,  then  perhaps  filtering  or 
smoothing  the  raw  predictor  values  will  increase  the  signal-to- 
noise  ratio  and  thereby  allow  the  PDM  to  extract  the  infor¬ 
mation  needed  to  make  its  forecast.  To  investigate  this  possi¬ 
bility.  a  series  of  forecasts  was  made  using  two  types  of  filters : 

1.  A  seven-point  running  mean  was  applied  to  each  pre¬ 
dictor  time  series.  Thus  each  predictor  value  X(7.  k),  k  =  1. 
■  ■  ■ .  K.  was  replaced  by  a  smoothed  value,  X,(  j.  (c),  given  by 

XiJ.  k)s]-  £  X(/.  k) 

'  j  ‘i-y 

The  3  months  at  the  beginning  and  end  of  the  476-month  time 
series  were  left  unsmoothed.  The  PDM  analysis  then  pro¬ 
ceeded  as  before,  but  now  using  the  X/j,  k)  as  piedictors. 

2.  As  before,  the  training  set  X„  was  selected  to  be  the  first 
=  396  months  of  each  of  the  K  =  7  predictors.  A  PCA  was 

then  performed  on  the  training  set  to  get 

A  =  X„  .  E 

where  E  s  [e,.  •  •  ■ .  e,]  is  the  7  x  7  matrix  of  empirical  or¬ 
thogonal  functions  (EOFs)  and  A  =  [a,,  ’.  a.,]  is  the 

3%  X  7  matrix  of  principal  components.  The  principal  com¬ 
ponent  time  series  a^  =  [oyfl),  ■  ■  ■ ,  ay(N„)]'^,7  =  1.  ■  •  ■ ,  A,  were 
ordered  by  the  size  of  their  associated  eigenvalue  and  were 
used  as  the  predictors  in  training  the  PDM,  rather  than  using 
the  original  X(7,  k)  as  predictors  (compare  section  2.12).  Since 
the  a^  are  orthogonal,  we  can  do  no  further  predictor  ranking 
using  correlations  between  predictors  (compare  section  3.1). 

The  testing  set  X„  was  defined  as  before  to  be  the  predictors 
from  1980  to  1986.  However,  before  making  a  forecast  using 
the  testing  set,  we  replaced  X„  by  amplitudes  A,^  defined  by 

A„sX,..E 

where  E  is  the  EOF  matrix  of  the  training  set.  We  thus  per¬ 
formed  the  same  transformation  on  the  training  and  testing 
sets,  so  that  the  A„  values  can  be  used  in  the  probability 
distribution  functions  of  the  PDM. 

A  series  of  experiments  was  made  to  compare  the  forecasts 
made  using  the  filtered  predictors  with  the  forecasts  mode 
using  the  raw  data.  The  Bayesian  forecast  strategy  and  a  level 
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Fig.  19.  The  a,  scores  for  various  Iwo-predictor  RDM  models: 
solid  circles,  uniUtered  predictors  S  and  6:  diamonds,  predictors  S  and 
6  with  a  seven-point  running  mean:  open  circles,  principal  compo¬ 
nents  I  and  2;  squares,  seven-point  running  mean,  then  PCA  and 
using  principal  components  I  and  2:  triangles,  persistence  of  the  pre- 
dictand  values. 


2  decomposition  of  the  category  swarms  were  chosen.  Figure 
19  compares  the  Og  scores  of  the  various  two-predictor 
models.  We  note  first  that  the  Og  scores  obtained  after  apply¬ 
ing  the  seven-point  running  mean  filter  to  predictors  S  and  6 
are  in  fact  lower  than  the  scores  obtained  using  unfiltered 
predictors  S  and  6.  However,  if  we  perform  a  PCA  and  then 
use  principal  components  I  and  2,  the  Og  scores  are  generally 
higher  than  the  scores  of  the  unhltercd  two-predictor  model 
out  to  r  =  2.  These  results  can  be  interpreted  as  follows.  The 
running  mean  is  a  low-pass  temporal  fitter  which  leaves  a 
low-frequency,  but  possibly  still  random,  time  series.  The  spa¬ 
tial  correlations  between  the  predictor  time  series  are  rela¬ 
tively  unchanged  by  the  temporal  smoothing.  The  PCA  oper¬ 
ation,  on  the  other  hand,  is  a  spatial  filter,  and  the  resulting 
time  series  a,  contains  spatially  coherent  information  from  ail 
of  the  original  predictor  regions.  Time  series  Rj  also  contains 
spatially  coherent  information  from  all  of  the  original  predic¬ 
tor  regions,  though  of  a  spatial  pattern  which  is  distinct  from 
that  of  a,.  Thus  merely  filtering  high-frequency  noise  from  the 
predictor  time  series  does  not  improve  the  Ug  scores,  whereas 
using  the  spatially  coherent  signal  from  all  of  the  original 
predictor  regions  does  lead  to  a  better  set  of  predictors  a,  and 
Rj.  Figure  19  also  shows  that  if  we  first  apply  the  seven-point 
running  mean  to  each  of  the  original  seven  predictors  and 
then  perform  a  PCA  on  the  smoothed  time  series,  we  get 
greatly  improved  a,  scores  for  short  time  tags,  although  the  Og 
scores  are  degraded  for  longer  time  lags.  This  latter  effect  may 
simply  be  due  to  statistical  uncertainty  in  the  estimation  of  the 

Og- 

For  reference  purposes.  Figure  19  also  shows  the  Og  scores 
obtained  by  persistence;  that  is,  the  observed  category  at  time 
j  is  used  as  the  forecast  for  time  j  +  t.  (For  r  •  0,  then,  persist¬ 
ence  uses  the  observed  category  to  forecast  itself  and  obtains  a 
perfect  score  of  Og  -  I.)  Since  the  SST  anomaly  categories,  as 
terciled,  are  quite  persistent,  perswtenoe  attains  a  hi^  a, 
score.  In  a  similar  fashion,  dimaiology,  which  always  forecasts 
tercile  category  2,  attains  a  score  of  Og  -  0.72S  owing  to  the 
chosen  terciKng  scheme.  Neither  persisteiioe  nor  climatology 
can  forecast  the  onset  of  an  El  Nilio,  however,  and  thus  are 
not  valid  competitors  in  actual  forecast  situations. 


We  also  note  that  scores  like  Ug  are  overall  measures  of  a 
forecaster’s  performance  over  the  time  span  of  the  testing  set. 
If  we  are  interested  only  in  forecasting  the  onset  of  an  El 
Niho,  then  a  low  Ug  score  does  not  necessarily  imply  poor 
model  performance,  nor  does  a  high  Og  score  imply  success  in 
the  forecast. 

E'igure  20  shows  the  actual  category  forecasts  made  by  the 
smoothed  two-predictor  PDM  model  using  a,  and  aj  as  pre¬ 
dictors.  We  see  that  the  PDM  forecasts  are  similar  to  that  of 
the  Barnett  model  for  small  t:  a  rise  to  the  above-normal 
category,  followed  by  a  fall  to  the  below-norma!  category.  But 
the  PDM's  longer-lead  forecast  missed  the  peak  of  the  event 
and  also  failed  to  predict  the  longevity  of  the  warming.  Thus 
even  though  the  preliminary  PCA  spatial  filtering  of  the  noisy 
wind  fields  helped  the  model,  it  has  not  been  able  to  extract 
the  same  information  from  the  original  data  set  as  did  the 
linear  prediction  model.  In  essence,  the  PP  scores  suggest  that 
there  is  so  much  variability  between  El  Nifio  events  that  the 
requisite  pdfs  are  poorly  defined,  and  so  the  PDM  should  fail. 
Further,  the  1982-1983  event  was  quite  unusual  for  a  variety 
of  reasons  and  so  may  not  fit  well  into  the  statistical  structure 
determined  from  the  training  set.  These  problems  withstand¬ 
ing,  we  still  should  expect  the  PDM  to  fail  in  1983  for  the 
same  reason  the  linear  prediction  model  failed  [cf.  Barnett, 
1984]. 

In  summary,  the  PDM  did  not  perform  particularly  well  on 


Fig.  20.  Category  forecasts  made  using  principal  components  I 
and  2  (circles  of  Figure  19).  (<■)  The  actual  SSI  anomalies  and  the 
forecast  made  by  the  Barnett  model,  with  t  =  4  months,  (h)  The  ob¬ 
served  tercile  categories  (a  perfect  forecast).  Figures  20r-20g  forecasts 
for  tiiiK  lags  r,  as  shown. 
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Fig.  16.  The  testing  set  for  prediction  of  the  1983  El  NiOo.  The  light  curve  and  the  scale  at  the  right  show  the  actual  SST 
anomalies.  The  heavy  curve  and  the  scale  at  the  left  show  the  corresponding  tercile  category  values. 


4.  Application  of  the  PDM 

This  section  is  intended  to  show  some  of  the  strengths  and 
weaknesses  of  the  PDM.  Thus  we  show  a  forecast  scenario  in 
which  the  method  does  not  do  particularly  well  and  one  where 
it  apparently  does  better  than  other  conventional  forecast 
schemes.  The  first  example  is  particularly  illuminating,  for 
there  we  intercompare  results  obtained  using  some  of  the  dif¬ 
ferent  strategies  discussed  earlier,  thereby  giving  the  reader  a 
feeling  for  the  sensitivity  of  the  PDM  to  the  details  of  its 
construction.  Since  this  is  essentially  a  theoretical  paper,  the 
discussion  of  the  applications  is  brief  Additional  examples  of 
the  PDM  in  operation  are  given  by  Preisendorfer  et  al.  [1987]. 

4.1.  Forecasting  the  El  Nino  of  1982-1983 

Barnett  [1984]  addressed  the  problem  of  statistically  fore¬ 
casting  sea  surface  temperature  (SST)  anomalies  in  the  equa¬ 
torial  Pacific  using  wind  anomalies  as  predictors,  during  the 
1982-1983  El  Niiio.  That  study  used  an  advanced  regression 
model  which  related  the  SST  anomalies  in  the  predictand  re¬ 
gions  to  the  prior  wind  anomalies  in  the  predictor  regions. 
The  study  showed,  among  other  things,  that  it  was  possible  to 
forecast  the  onset  of  El  Niflo,  as  measured  by  SST  anomalies 
in  a  region  off  the  coast  of  Peru,  using  wind  anomalies  from 
various  regions  in  the  central  Pacific.  These  forecasts  were 
successful  at  lead  times  of  up  to  4  months.  Although  the 
model  did  an  acceptable  job  of  forecasting  the  onset  of  the 
1982-1983  El  Nino,  it  failed  to  accurately  predict  the  decline 
of  the  El  Nino,  for  reasons  discussed  in  the  1984  paper.  It  was 
felt  that  a  repetition  of  this  study  would  be  another  means  of 
evaluating  the  PDM's  forecast  ability. 

The  data  set  consists  of  monthly  wind  and  temperature 
anomalies  for  the  476  months  from  January  1947  to  August 
1986.  There  are  four  regions  of  the  equatorial  Pacific  for 
which  u-component  (east-west)  wind  anomalies  are  available, 
and  three  regions  for  which  there  are  v-component  (north- 
south)  wind  anomalies.  Thus  there  are  seven  possible  predic¬ 
tors  (labeled  1,  ■  ■  ■ ,  7  and  corresponding  to  Barnett's  f/l,  U2, 
1/3,  1/4,  PI,  P2,  and  V3,  respectively).  The  predictand  SST 
anomalies  were  terciled,  so  that  only  the  extreme  events  would 
fall  outside  the  “normar  category.  Inspection  of  the  SST 
record  shows  that  if  boundaries  B,  =  — 0.5°C  and  Bj  =  I.2X 
are  selected  (see  section  2.2),  then  slightly  less  than  one  sixth  of 
the  anomalies  fall  into  category  1  (below  normal  SST),  some¬ 
what  more  than  two  thirds  fall  into  category  2  (normal  SST), 
and  slightly  less  than  one  sixth  fall  into  category  3  (above 
normal  SST).  The  above-normal  category,  so  defined,  contains 
only  anomalies  which  are  greater  than  two  standard  devi¬ 
ations  from  the  mean,  which  is  a  reasonable  definition  of  El 


Nino.  The  396  months  from  January  1947  to  December  1979 
were  taken  to  be  the  training  set,  and  the  80  months  from 
January  1980  to  August  1986  were  taken  to  be  the  testing  set. 
The  training  set  contains  several  El  Ninos,  so  we  thought  that 
the  PDM  should  have  a  good  opportunity  to  define  the  cat¬ 
egory  pdf’s.  The  1982-1983  event  stands  out  prominently  in 
the  testing  set,  as  is  seen  in  Figure  16.  Furthermore,  the  testing 
set  is  largely  independent  of  the  training  set,  although  there  is 
substantial  autocorrelation  within  each  set  (compare  section 
2. 1  and  section  2.4). 

The  PDM  was  applied  in  various  configurations  ; 

1.  Both  maximum  probability  and  Bayesian  strategies 
were  used.  In  the  Bayesian  case  the  priors  were  made  pro¬ 
portional  to  the  number  of  points  in  the  category  (compare 
section  2.7). 

2.  Category  swarms  were  forced  to  undergo  a  predeter¬ 
mined  number  of  PCA  subdivisions,  either  zero  (as  seen  in 
Figure  13),  2  (as  seen  in  Figure  14),  or  the  maximum  possible 
number  (as  seen  in  Figure  IS),  as  discussed  in  section  3.4. 

3.  The  potential  predictability  was  used  to  measure  the 
separation  of  the  category  pdf's,  although  the  S%  significance 
levels  were  computed  only  in  the  single-predictor  cases  lowing 
to  computational  expense). 

4.  The  individual  predictors  were  rated  by  their  potential 
predictability  scores  in  order  to  select  the  first  predictor.  Sub¬ 
sequent  predictors  were  added  to  the  model  in  the  order  given 
by  the  correlations,  as  described  in  section  3.1.  Models  con¬ 
taining  1-7  predictors  were  compared. 

For  a  time  lag  of  r  =  0.  predictor  5  (wind  in  region  FI)  has 
the  highest  potential  predictability  score  of  any  individual  pre¬ 
dictor.  If  the  maximum  probability  strategy  is  chosen,  this 
value  is  PP  =  0.196;  the  5%  significance  level  is 
PP(96)  =  0.019,  so  that  PP  is  significant.  For  the  Bayesian 
strategy,  PP  =  0.377  and  PP(96)  =  0.316,  so  that  PP  is  once 
again  significant.  Predictor  S  thus  becomes  the  first  predictor 
of  the  PDM  model.  Predictor  6  (wind  in  region  F2)  is  least 
correleted  with  predictor  5,  and  therefore  becomes  the  next 
predictor  added  to  the  model.  With  two  or  more  predictors  in 
the  model,  we  also  have  the  possibility  of  forecast  skills  de¬ 
pending  on  the  number  of  PCA  decompositions  of  the  cat¬ 
egory  sets.  Figure  17  shows  the  dependence  of  the  potential 
predictability  on  the  form  of  the  PDM  model.  In  Figure  1 7  we 
note  the  following  behavior  of  the  potential  predictability ; 

1.  The  relatively  low  initial  PP  values,  while  significant, 
indicate  that  the  category  pdf’s  are  not  very  distinct.  We  im¬ 
mediately  expect  that  the  PDM,  as  constituted  for  this  prob¬ 
lem,  will  not  perform  well. 

2.  All  else  being  equal,  PP  is  greater  for  the  Bayesian  fore¬ 
cast  strategy  than  for  the  maximum  probability  strategy. 
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PREDICTOR  1 

Fig.  14.  Contours  of  equal  probability  of  the  three  pdPs  0^),  as 
determined  from  a  level  2  PCA  decomposition  of  each  category 
subset  of  Figure  5.  Contour  intervals  vary. 


verified  by  their  PP  value  of  PP  =  0.87,  but  the  noise  in  the 
data  (i.e.,  the  positions  of  the  individual  points)  has  clearly 
affected  the  pdfs  themselves.  If  the  pdf’s  of  Figure  15  were 
used  for  actual  forecasting,  it  might  often  occur  that  predictor 
values  X'  would  “fall  into  the  gaps”  of  these  irregularly  shaped 
pdf's  in  such  a  manner  as  to  cause  the  point  to  be  ascribed  to 
the  wrong  pdf,  thus  giving  an  incorrect  forecast. 

Given  the  probabilities  F(i,  q),  the  potential  class  errors  dg 
and  d,  are  immediately  available.  The  actual  class  errors  a„ 
and  a,  are  now  computed  from  the  multipredictor  testing  set 
XJi),i=  I,-  -,  N,.. 

Monte  Carlo  experiments  for  determining  5%  significance 


PREDICTOR  1 

Fig  IS.  Contours  of  equal  probability  of  the  three  pdTs  0/XX  as 
determined  from  the  maximum  possible  PCA  decomposition  of  the 
category  subsets  of  Figure  S.  Contour  intervals  vary. 


levels  on  PP,  dg,  d,,  a„,  and  a,  proceed,  in  principle,  as  before. 
Now,  however,  when  the  randomly  generated  predictand  /I(i) 
is  analysed  using  the  multivariate  predictors,  it  is  necessary  to 
perform  a  full  PCA  decomposition  in  order  to  get  the  needed 
pdf's  (as  described  in  sections  3.3  to  3.6).  This  PCA  analysis 
becomes  prohibitively  expensive  when  it  must  be  repeated  100 
times  in  a  Monte  Carlo  experiment.  Thus  in  practice,  the  5% 
significance  levels  may  not  be  available. 

3,9.  Final  Screening  of  the  Candidate 
Predictor 

We  recall  from  section  3.1  that  we  have  admitted  a  candi¬ 
date  f,th  predictor  to  the  PDM  model,  based  u(>on  the  corre¬ 
lation  screening  described  there.  We  now  may  use  the  infor¬ 
mation  gathered  in  the  previous  paragraph  to  decide  whether 
or  not  to  keep  the  candidate  predictor  in  the  model.  Let  dgfL 
—  1)  and  d,(L  —  1)  denote  the  dg  and  d,  scores  obtained  from 
the  PDM  model  before  the  candidate  Lth  predictor  was  ad¬ 
mitted  (if  L  =  2,  we  have  the  single  predictor  potential  class 
errors  available).  Let  PP(L),  dg(L),  and  d,(L)  be  the  scores 
obtained  after  the  candidate  Lth  predictor  was  admitted. 
Moreover,  let  PP(96:  L),  do(96;  L),  and  d,(05;  L)  be  the  appro¬ 
priate  S%  critical  values,  as  determined  by  Monte  Carlo  simu¬ 
lations.  We  then  accept  the  candidate  Lth  predictor,  X{j,  L), 
into  the  PDM  model,  if  the  following  conditions  hold ; 

Condition  I 


PP(L)  2;  PP(96;  L) 

Condition  2 

dg(L)  >  dg(L  -  1)  d,(L)  £  d,(L  -  1) 

Condition  3 

<ig(L)  2  flg(96;  L)  <i,(L)  g  <i,(05;  L) 

If  these  three  conditions  are  not  satisfied,  we  delete  the  candi¬ 
date  predictor  from  the  model  and  return  to  section  3.1  to 
select  the  next  candidate  predictor.  We  continue  in  this 
manner  until  all  possible  predictors  have  been  examined,  at 
which  time  the  PDM  model  is  complete. 

Condition  1  is  simply  the  requirement  that  the  model  have 
a  statistically  significant  potential  predictability.  Condition  2 
is  the  requirement  that  the  addition  of  the  Lth  predictor  im¬ 
prove  the  potential  class  error  scores,  and  condition  3  ex¬ 
presses  the  requirement  that  the  model’s  potential  class  error 
scores  be  statistically  significant.  Conditions  1  and  3  can  be 
relaxed  by  using,  say,  a  10%  significance  level  instead  of  the 
5%  level  shown.  Condition  2  cannot  be  relaxed.  For  complete 
rigor  the  critical  level  should  decrease  as  the  number  of  possi¬ 
ble  predictors  is  increased.  This  allows  for  the  probability  that 
one  of  the  predictors  will,  by  shear  chance,  appear  useful 
(compare  section  2.12). 


3.10.  Scoring  the  PDM  Model 
Once  the  PDM  model  is  complete,  we  can  compute  the 
actual  class  errors  Og  and  a,,  using  the  testing  set  X,/!),  i  —  1, 
■  ■  ■ ,  N„  generated  during  the  examination  of  the  final  predic¬ 
tor  which  was  admitted  to  the  model.  These  Og  and  e,  scores, 
together  with  the  information  shown  in  conditions  1,  2,  and  3 
in  section  3.9,  are  the  data  by  which  we  measure  the  PDM 
model’s  actual  and  potential  skills. 
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Fig.  23.  Percentage  a„  scores  obtained  from  Monte  Carlo  experiments  in  which  the  category  forecast  was  made  at 
random.  The  expected  value  is  33.  The  numbers  in  parentheses  show  the  scores  of  the  two-predictor  PDM,  from  Figure 
22. 


4.  Forecasts  were  made  for  winter  at  a  lead  time  of  one 
season,  e.g.,  fall  SLP  predicting  winter  temperature.  The  scores 
are  shown  as  “percent  correct  category  forecasts”  and  thus  a 
randomly  made  forecast  has  an  expected  value  of  33%.  Note 
that  these  are  actual  forecast  skills,  since  the  testing  sets  in  no 
way  entered  the  predictor  screening  or  PDM  pdf  construction. 

The  results  of  the  single-predictor  experiments  are  shown  in 
Figure  21.  Monte  Carlo  simulations  showing  Og  forecast  skill 
values  averaged  over  the  entire  United  States  in  excess  of  50% 
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